The following tutorials teach how to run clustalw2 and clustal-omega from the command line, along with a few useful AWK, Bash and Perl idioms and scripts.
This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the clustalw2 and clustal-omega multiple sequence alignment programs installed and found in the $PATH. Both programs are freely available for download from the clustal.org
Multiple sequence alignments (MSAs) are essential in most bioinformatics analyses that involve comparing homologous sequences. The exact way of computing an optimal alignment between N sequences has a computational complexity of \(O(L^N)\) for \(N\) sequences of length \(L\) making it prohibitive for even small numbers of sequences using exhaustive algorithms (Sievers et al. 2011). Therefore, all MSA algorithms implement some sort of heuristics (see Katho K. ed. 2021), such as the progressive multiple sequence alignment algorithm implemented in the Clustalw and preceding Clustal versions.
Clustal programs are used for multiple sequence alignment of DNA and protein sequences. The software and its algorithms have gone through several iterations, with \(Clustal{\Omega}\) (Omega) being the latest version, published in of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.
Clustal has been an important bioinformatic software, with two of its academic publications among the top 100 papers cited of all time, according to Nature in 2014.
Here we will demonstrate the use of the current command-line versions: clustalw2/clustalX and clustal-omega to perform:
Both \(Clustalw2\) and \(Clustal{\Omega}\) can read multiple sequence formats, including CLUSTAL and FASTA, and write the computed alignments in even more formats, including STOCKHOLM, NEXUS, and PHYLIP.
The following code will set up our working directories for the tutorials and fetch the required sequences and scripts.
# i) Change directory into your working directory and make the following new directory to hold the sequence data for the exercises
mkdir sesion4_alineamientos && cd sesion4_alineamientos
# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion4_alineamientos/sequences_for_alingment.tgz
if [ -s sequences_for_alingment.tgz ]
then
if file sequences_for_alingment.tgz | grep gzip &> /dev/null
then
echo "unpacking sequences_for_aligment.tgz ..."
tar -xzf sequences_for_alingment.tgz
else
echo "ERROR: This is not a gzip file"
fi
else
echo "ERROR: sequences_for_aligment.tgz not found"
fi
# iii) Download the required scripts to your ~/bin directory
base_url=https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/refs/heads/master
if [ -d ~/bin ]; then
cd ~/bin
else
mkdir ~/bin && cd ~/bin
fi
# This is a Bash array holding the list of scripts to be downloaded
scripts=(split_fasta.pl translate_fastas.pl prot2cdnAln.pl convert_alnFormats_using_clustalw.sh convert_aln_format_batch_bp.pl fasta_toolkit.awk)
# Iterate over the array elements, fetching each script with 'wget -c' and making them executable with 'chmod 755'
for f in "${scripts[@]}"
do
wget -c "$base_url"/"$f"
[ -s "$f" ] && chmod 755 "$f"
done
# Change directory back to the previous working directory
cd -
# list the directory contents
pwd && ls
We are set to start the hands-on tutorials on running clustal from the command line.
Clustalw implements a fast version of the progressive multiple sequence alignment algorithm, featuring a fast and simple method for making “guide trees.” These are clusterings of the sequences that are used to decide the order of alignment during the later progressive alignment phase. The general idea is to start with alignments of just two sequences, usually the closest ones in the dataset. The alignment is then built up by aligning alignments with each other or sequences to alignments, according to the topology of the guide tree. The complexity of the guide tree construction is usually \(O(N^2)\) for \(N\) sequences because all \(N\) sequences have to be compared to each other. Earlier versions of Clustal used fast word based alignments for these comparisons which made it memory efficient and fast enough to work on PCs and Macintosh computers with up to a few hundred sequences (Sievers & Higgins, 2018).
Clustalw can be freely obtained from the Clustalw2 webpage. Clustal 2 comes in two flavors: the command-line version Clustal W and the graphical version Clustal X. Precompiled executables for Linux, Mac OS X and Windows (incl. XP and Vista) of the most recent version (currently 2.1).
Here we are going to use only the command-line version clustalw.
Note that the \(clustalw\) and \(clustalw2\) calls are equivalent and execute the same clustalw version 2.1 binay.
Below is a compilation of some of the most frequently used \(clustalw\) options:
#====================================================================================== # Basic clustalw usage #-------------------------------------------------------------------------------------- # explore the manuals man clustalw clustalw -help clustalw -fullhelp # standard clustal command line call clustalw -infile=fasta_no_alineado -align -output=fasta -outfile=archivo_cluwaln.fasta # clustalw call with modified parameters clustalw -infile=archivo.fasta -align -pwmatrix=blosum -pwgapopen=12 -pwgapext=0.2 -matrix=blosum -gapopen=12 -gapext=0.2 -outorder=aligned -convert -outfile=archivo_aln1.phy -output=phylip
The next lines show how to align the the provided FASTA files with GDP sequences from prokaryots and eukaryots, writing the resulting multiple sequence alignments (MSAs) in FASTA format.
# default clustalw alignments
clustalw -infile=4_GDP_procar_ualn.faa -outfile=4_GDP_procar_cluAln.faa -output=fasta
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln.faa -output=fasta
# A profile to profile alignment
clustalw -profile -profile1=6_GDP_eucar_cluAln.faa -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar.aln
The clustal format can be conveniently visualized directly in the terminal, as shown below:
# display the 10_GDP_eucar_prokar.aln MSA written in clustal format
cat 10_GDP_eucar_prokar.aln
CLUSTAL 2.1 multiple sequence alignment
gpdhuman ---------------MASKK----------------VCIVGSGNWGSAIAKIVGGNAAQL
gpdarabit ----------------AGKK----------------VCIVGSGDWGSAIAKIVGGNAAQL
gpdamouse ----------------AGKK----------------VCIVGSGNWGSAIAKIVGSNAGRL
gpdadrome ---------------MADKVN---------------VCIVGSGNWGSAIAKIVGANAAAL
gpdacaeel ---------------MSPKK----------------VTIIGSGNWGSAIARIVGSTTKSF
gpd1yeast MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY
gpdaecoli -------------MNQRNAS----------------MTVIGAGSYGTALAITLARNGHEV
gpdahaein -------------MITSQTP----------------ITVLGAGSYGTALAITFSRNGSPT
gpdpseu -----------------------------------------MTSCCGAVAKTRWRN----
gpdabacsu -----------------MKK----------------VTMLGAGSWGTALALVLTDNGNEV
. ::* .
gpdhuman A-QFDPRVTMWVFEEDIGGKKLTEIINTQHENVKYLPGHKLPPNVVAVPDVVQAAEDAD-
gpdarabit A-QFDPRVTMWVFEEDIGGKKLTEIINTHQENVKYLPGHKLPPNVVAVPDVVKAAADAD-
gpdamouse A-HFDPRVTMWVFEEDIGGRKLTEIINTQHENVKYLPGHKLPPNVVAIPDVVQAATGAD-
gpdadrome P-EFEERVTMFVYEELIDGKKLTEIINETHENVKYLKGHKLPPN-VAVPDLVEAAKNAD-
gpdacaeel PDEFDPTVRMWVFEEIVNGEKLSEVINNRHENIKYLPGKVLPNNVVAVTDLVESCEGSN-
gpd1yeast PEVFAPIVQMWVFEEEINGEKLTEIINTRHQNVKYLPGITLPDNLVANPDLIDSVKDVD-
gpdaecoli V--------LWGHDPEH-----IATLERDRCNAAFLPDVPFPDTLHLESDLATALAASR-
gpdahaein H--------LWGHNPAH-----IAQMQTERQNYRFLPDVIFPEDLHLESNLAQAMEYSQ-
gpdpseu -------------------------VPDPCENTAYLPGHPLPAALKATADFSLALDHVAQ
gpdabacsu C--------VWAHRADL-----IHQINELHENKDYLPNVKLSTSIKGTTDMKEAVSDAD-
: * :* . :. .:. :
... TRUNCATED
Note that clustal uses the following symbols below each alignment block to highlight conserved residues:
* (asterisk) positions that have a single and fully conserved residue
: (colon) conserved: conservation between groups of strongly similar properties (score > 0.5 on the PAM 250 matrix)
. (period) semi-conserved: conservation between groups of weakly similar properties (score ≤ 0.5 on the PAM 250 matrix)
(blank) non-conserved
Only the N-terminal part of the MSA is shown. Notice the long N-terminal extension of the yeast sequence. What do you think about the MSA around this extension?
Seaview is a multiplatform, graphical user interface for multiple sequence alignment and molecular phylogeny.
Seaview can be run from the command line, and the options can be found with:
# display the help menu
seaview -h
or here: seaview command manual
To visualize an alignment, simply call \(seaview\) with the alignment file name as single argument:
seaview 10_GDP_eucar_prokar.aln &
We will first attempt to perform a default MSA of nine full-length leuA sequences from diverse Bacillales using \(clustalw\)
Lets firs count the sequences, and have a look at them
# count sequences
grep -c '^>' leuA-Bacillales.fas
# display file
cat leuA-Bacillales.fas
9 >B_anthracis_Ames 30261500 atgaagcaaattttgttcatggatacgacgctacgtgatggcgaacaatcaccaggagtaaatttaaatgaacaagagaa attgcaaattgcaaggcaattagagagactagggattcatgtcatggaagctggctttgcagcagcttccgaaggggatt ttcaatcggtaaagcgcatcgcaaatacaattcaaaatgctacagttatgagtttagcaagagcgaaagaaagtgatatt cgaagagcatatgaagctgtgaaaggtgcggtatctcctcgtttacacgtatttttagctacgagtgatattcatatgaa gtataaactttgtatgtcaaaagaagatgtattagatagtatttatcgatcggttacacttgggaaatcgttatttccga cagtacaattttcagcagaagatgcaacaagaacagcaagagattttttagctgaagcggtagaagttgcgattcgtgca ggagcgaatgtaattaatattccagatacagttgggtatacgaatccagaagaatattatgctctttttaaatatttaca agaatccgttccttcatatgaaaaagcaattttctcttgtcattgtcatgatgatcttgggatggcggtagcaaattcat tagctgcagttgaaggcggagcgttgcaagtagaaggaacaattaatgggattggagaaagagcagggaatgcggcgtta gaagaggttgcagttgctcttcatattaggaaagatttctataaagcagagccttctatgacgctaaaagaaattaaagc gacaagtacattagtaagcagattaacgggtatggttgtatcgaaaaataaagcgattgttggagcaaatgcattcgctc atgaatcaggcattcatcaagatggtgtattaaaagaagtgacaacatatgaaattattgaaccagcgctagtaggcgaa tctcaaaatctatttgtacttggaaaacattctgggcgtcacgcgtttacagaaaaaatgaaagaacttggctatgaatt tacaaacgacgagcgggatgcagtatttgaagcatttaaaaaattagctgatcgtaagaaggaaattactgaggaagatt tacgagcacttatgcttggtgaagcagcatttgcggctcagcaatataacattacgcaattgcaagtacatttcgtatca aatagtacacaatgtgcgacagttgtactaaaagatgaggaaggaaatgtattcgaagatgcagctactggctctggtag tattgaagcgatttataatgcaatccaaagaattttaggattggaatgtgaattggcggattatcgcatacaatctatta cacaaggtcaagatgcactggctcatgttcatgttgaattaaaagaaggagcacatcaagtatcaggttttggtgttgca caagacgtattggaagcgtcggcaagagcatatgtccacgcagcggggaaattgaaatcttttatccagcttgtgaaata a ... truncated
Align them under default parameters:
# generate the MSA
clustalw -infile=leuA-Bacillales.fas -outfile=leuA-Bacillales.aln
# display the MSA
cat leuA-Bacillales.aln
... truncated
B_anthracis_Ames TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_thuringiensis_konkukian TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_cereus_ATCC14579 TGTTGAATTAAAAGAA--GGT-AGTCACCAAATATCAGGTTTCGGAGTTG
B_licheniformis_ATCC_14580 CGTCAGGGTG-ATGAT--CGATGGAAAAGAATCGGGAGGACGCGGGGTTG
B_subtilis TGTAAGAGTT-TTGGT--GAACGGAAAAGAATCAGCAGGGCGGGGCATAG
B_halodurans TGTTCAAATGGATTAT--CAA-GGAGTCGTCTCAAGTGGACGCGGCACGG
O_iheyensis TGTTCAAATT-ATTAT--TAATGGAGAAACAATCAATGGTAGAGGGTCTG
L_innocua TGTCGTCATTAAAGATGATAATGGCACTGAATTCCATGGTATCGGTATCG
L_monocytogenes_4b_F2365 TGTCGTCATTAAAAATGACAAAGGCGCTGTGTTCCATGGTATCGGTATTG
** * ** ** *
B_anthracis_Ames CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_thuringiensis_konkukian CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_cereus_ATCC14579 CGCAAGACGTATTGGAAGCGTCGGCGAGAGCTTATGTCCATGCAGCAGGG
B_licheniformis_ATCC_14580 CCCAGGATGTTCTCGAAGCATCTGCCAAAGCCTACTTGAATGCTGTGAAC
B_subtilis CGCAAGACGTATTAGAAGCATCAGCGAAAGCCTATTTGAACGCAGTAAAC
B_halodurans CCCATGATGTGTTAGAAGCGTCAGCAAAAGCCTATTTAAACGCAGTTAAT
O_iheyensis CTCAGGATGTGCTTCAAGCGTCTGCAGTAGCATTCATTCAGGCAGTAAAT
L_innocua ATTTTGATGTTTTAACAGCTAGTGCTAAAGCTTATTTGCAAGCATCTGGA
L_monocytogenes_4b_F2365 ATTTTGACGTTTTAACTGCTAGTGCTAAAGCTTATTTGCAAGCATCAGGC
** ** * ** ** *** * * * **
B_anthracis_Ames AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_thuringiensis_konkukian AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_cereus_ATCC14579 AAATTAAAATCTCTTATAGCGCTTGTGAAATAA-----------------
B_licheniformis_ATCC_14580 CGCTATCTCGTATTGAAGACGAATACGGAAGGATTGTCAAAACAGGCGGC
B_subtilis CGTCAATTGGTTTTCCAGTCGAATATGAGCGGATTGAAAAACCACACAGC
B_halodurans CG--AACGATTAATCGAAAGAAATATGCAGAAGTTTATGGATCGAA----
O_iheyensis CGTTATTACGT--TCAAAAGAAA-----GCAAATC--TAACTCGACTTGT
L_innocua AAAAGCAAATCTACAAGTAAGCA----AGCTGATTTCGAGGAGGTAAAGT
L_monocytogenes_4b_F2365 AAAAGCAAAACCGCCAGTAAGCA----AGCTGATTTCGAGGAGGTAAAAT
... truncated
What do you notice from the output shown above of the leuA-Bacillales.aln MSA?
The correct mode of aligning CDS requires their translation and alignment at the protein level.
Here we assume that all CDSs in the FASTA file are in the same+1 frame.
translate_fastas.pl -e fas -t 11
Lets visualize the translated products:
cat leuA-Bacillales_translated.faa
>B_anthracis_Ames 30261500 MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA QDVLEASARAYVHAAGKLKSFIQLVK* >B_cereus_ATCC14579 30019549 LFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRIANTI QNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLNSIHRS VTLGKSLFPTVQFSAEDATRTSRAFLAEAVEIAIRAGANVINIPDTVGYTNPEEYYSLFE YLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAALEEVA VALHIRKDFYKAEPSMTLKEIKATSILVSRLTGMVVPKNKAIVGANAFAHESGIHQDGVL KEVTTYEIIEPALIGESQNLFVLGKHSGRHAFTEKMKELGYEFTNEERDAAFEAFKKLAD RKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVYEDAATG SGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGSHQISGFGVAQDVL EASARAYVHAAGKLKSLIALVK* ... truncated
Note that the stop codons were translated as ’*’ symbols. These are not aminoacids and should be removed. This can be easily achieved with a simple \(sed\) one-liner:
# use sed to replace the terminal * symbols by nothing
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa
# visualize the edited translation products
cat leuA-Bacillales_translated.faa
>B_anthracis_Ames 30261500 MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA QDVLEASARAYVHAAGKLKSFIQLVK ... truncated
Note that it is critical to conserve the original ordering of the sequences from the input file in the output MSA using the \(-OUTORDER=input\) option.
clustalw -help
clustalw -infile=leuA-Bacillales_translated.faa -outfile=leuA-Bacillales_translated_clu.faa -output=fasta -OUTORDER=input
prot2cdnAlns.pl -h
prot2cdnAlns.pl leuA-Bacillales.fas leuA-Bacillales_translated_clu.faa
Let us display both the codon and default MSAs of the leuA CDSs:
seaview leuA-Bacillales_cdnaln.fas &
seaview leuA-Bacillales.aln &
The output shown above clearly demonstrates that when aligning CDSs, we should first translate them to proteins, perform the alignment of the translation products, and finally reconstitute the CDS alignment using the protein MSA as a guide.
Clustal\({\Omega}\) is the last incarnation of the Clustal family of MSA programs. It offers a significant increase in scalability over previous versions thanks to the implementation of novel algorithms that avoid bottlenecks of the classic progressive multiple sequence alignment algorithms (Sievers et al. 2011).
Clustalw and earlier Culstal versions use fast word based alignments for these comparisons which made it memory efficient and fast enough to work on PCs and Macintosh computers. However, as the number of sequences grows above a few thousand, the \(O(N^2)\) complexity associated with the computation of pairwise distance matrices to generate “guide trees” becomes very time consuming and makes very big alignments difficult to make.
\(Clustal\Omega\) implements an \(O(NlogN)\) method called \(mBed\) (Sievers et al. 2011) that allows guide trees of hundreds of thousands of sequences to be made by restricting the calculation of sequence alignment scores to NLog(N). This mBed method is what is used in Clustal Omega to give capacity and scalability for very large datasets.
The second main development in Clustal Omega was to use an alignment engine for aligning profile hidden Markov models (HMMs) to each other instead of the conventional dynamic programing and profile alignment. We used \(HHalign\) (Söding 2005) which had been shown to have very high accuracy for profile HMM alignment. This gives greatly increased accuracy to Clustal Omega when compared to earlier Clustal programs, as measured on structure based alignment benchmarks. Only a small amount of original code from the earlier Clustal programs was used for the new program: the fast word-based pairwise alignment routines. The rest of the code was coded new from scratch or taken from publically available libraries.
The paper describing Clustal\({\Omega}\) is the following one: Sievers et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11:7:539. doi: 10.1038/msb.2011.75
ABSTRACT Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
Here we will repeat the MSA of the twoFASTA files containing four and six bacterial and eukaryotic GDP protein sequences using clustalo called from a Bash for loop.
# call clustalo within a for loop to align all *ualn.faa files found in the working directory
for file in *ualn.faa; do clustalo -i $file -o ${file%.*}_cluOaln.phy --outfmt phy --output-order input-order; done
clustalo --profile1 4_GDP_procar_ualn_cluOaln.phy --profile2 6_GDP_eucar_ualn_cluOaln.phy --infmt phy --outfmt clu -o 10_GDP_eucar_prokar_cluOp2p.aln
clustalo --profile1 6_GDP_eucar_ualn_cluOaln.phy -i 4_GDP_procar_ualn.faa -o 10_GDP_eucar_prokar_cluOs2p.aln --outfmt clu
paste 10_GDP_eucar_prokar_cluOp2p.aln 10_GDP_eucar_prokar_cluOs2p.aln
Several differences can be seen between the p2p and s2p GDP alignments shown above. Which one is the best?
To gain evidence in favor of one or the other, we will perform a new \(clustalo\) alignment on the full 10 GDP sequence set using iterations and Kimura correction.
Iterations can be used in an attempt to improve the alignment by a process of repeated refinement (Sievers & Higgins, 2014). Iterative refinement can be accomplished in two ways: (i) guide-tree refinement and/or (ii) refinement using the multiple alignment output as an external profile HMM. This process is illustrated in the bottom part of Figure shown below (Sievers & Higgins, 2014).
The second iteration scheme uses a profile HMM, derived from an alignment of the input sequences, as an external profile. The first alignment gives an outline of the positions where residues are more likely to align in a second round of alignment.
Both iteration methods can be carried out a number of times, separately or in combination. The default is to combine them, as will be shown in the protocol below.
# 1. combine the source FASTA files into a single one named 10_GDP_procar_eukar_ualn.fas
cat 4_GDP_procar_ualn.faa 6_GDP_eucar_ualn.faa > 10_GDP_procar_eukar_ualn.fas
# 2. Perform clustalo MSA with two iterations and using Kimura-corrected distances
clustalo -i 10_GDP_procar_eukar_ualn.fas --iter 2 --use-kimura -o 10_GDP_eucar_prokar_cluoAln2iter.aln --outfmt clu
Clustal Omega will first construct a guide tree and align the sequences in the order specified by the guide tree. This alignment is then used in two ways:
During the second round of alignment (the first iteration), sequences are aligned to the internally constructed profile HMM, pseudo-count information is transferred to the unaligned sequences, and the ‘enhanced’ sequences are aligned to each other in the order specified by the newly constructed guide tree. The final alignment is written in FASTA format to outfile.fa .
The first iteration will be followed by another guide-tree construction and HMM pseudo-count transfer.
While the alignment accuracy is expected to be higher for the iterated alignments, so is also the response time. Every round of iteration can take up to three times longer than the initial alignment. For example, if the initial alignment took 1 min, then an alignment that was iterated twice might take up to 7 min; that is, 1 min for the initial alignment and then two times 3 min for the iterations.
Unfortunately, experience has shown that the alignment accuracy cannot be improved indefinitely by iteration. Good improvements are often achieved with one or two iterations (Sievers & Higgins, 2014).
Let us compare the standard and iterated alignments of the 10_GDP_eucar_prokar sequences:
paste 10_GDP_eucar_prokar_cluoAln2iter.aln 10_GDP_eucar_prokar_cluOp2p.aln
As shown above, there are minimal differences in the p2p vs. the iteration alignment with Kimura distances performed on the full set of 10 GDP sequences. Can you spot them?
Based on the high similarity of both alignments, we might conclude that the p2p or full alignment with two iterations are better than the s2p alignment presented in a previous section.
As discussed previously, CDSs should be aligned using their aligned translation products as a guide.
# 1. Translate the leuA-Bacillales.fas file with the aid of the translate_fastas.pl Perl script
translate_fastas.pl -e fas -t 11
# 2. Remove the trailing *'s resulting from the translation of stop codons present in the DNA sequences
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa
# 3. Align the tranlation products keeping the input order
clustalo -i leuA-Bacillales_translated.faa -o leuA-Bacillales_translated_cluo.faa --output-order input-order
# 4 Use the prot2cdnAlns.pl Perl script to generate the codon alignment using the protein alignment as a guide
prot2cdnAlns.pl leuA-Bacillales.fas leuA-Bacillales_translated_cluo.faa
Finally, we will visualize the resulting codon alignment of the leuA CDSs at the DNA and protein levels.
seaview leuA-Bacillales_cdnaln.fas &
As mentioned previously, a key feature of \(Clustal\Omega\) is that it can take an external profile hidden Markov model of a protein family and use it to align highly divergent members. As mentioned by Sievers & Higgins (2014), the main problem with progressive multiple sequence alignment is that alignments that were made during the early stages of the MSA cannot be changed at a later stage. This has been summarized by the phrase “once a gap, always a gap”.
Ideally one would like to have a degree of ‘foresight’ and know in advance which alignment columns are conserved and important in the alignment Sievers & Higgins (2014). A limited amount of foresight can be communicated to Clustal Omega if an alignment of sequences that are homologous and alignable to those in the input sequence set has already been constructed. Such alignment information can be conveniently described in the form of a profile hidden Markov model (profile HMM).
Use of profile HMMs is particularly natural within Clustal Omega, as its main alignment algorithm is based on alignment of profile HMMs (Söding, 2005). A profile HMM can be input into Clustal Omega as an ‘external profile’, which is used to guide the alignment of the unaligned sequences in the input file. The alignment of the unaligned residues will be guided by the distribution of residues and gaps in the HMM. Residues of the unaligned sequences will be aligned to the positions within the HMM that best fit the signature, and this is used to align those residues more accurately.
Experience has shown (Sievers et al., 2011) that external profile HMM information is rarely detrimental. Use of HMMs as external profiles is therefore very robust. However, its use is particularly helpful in the alignment of highly diverged sequences with low average pairwise sequence similarity. Such scenarios frequently lead to early misalignments and/or gaps, which can be partly avoided using background HMM information.
This procedure can be used if an existing multiple alignment of a set of sequences that has been carefully maintained is available. It can then be converted to a HMM and used to help align new sequences. Alternatively, HMMs can be downloaded from alignment databases such as PFAM. This procedure is also used to iterate the multiple alignment process, as described in the previous section.
As a final exercise, we will align the classic set of 146 human Ras GTPases compiled by Wennerberg et al. (2005) with the aid of PFAM00071.hmm model for the Ras superfamily fetched from PFAM. Note that on the chaac server the alignment takes ~20 minutes to complete using 10 cores.
time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa --hmm-in Ras_fam_PF00071.hmm -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.aln --outfmt clu
real 28m5.611s <== almost half an hour! user 2656m6.083s sys 0m10.238s
Below is a screenshot of the central part of the resulting alignment, the region modeled by the PFAM00071 model, for a fraction of the 146 sequences, and viewed with seaview.
seaview Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.aln &
The alignment shown above was used to compute the maximum likelihood tree shown below for the family (Vinuesa, unpubl.)
Seaview can be used to easily extract certain alignment regions or columns. These can then be used to compute the sequence motif and logos that summarize the residue frequency distribution using ad hoc Perl and R scripts written by the author.
Below the logo and motif of the first highly conserved (G1) region of Ras GTPases
A simpler, less informative way to represent these highly conserved signatures is by a sequence motif, as shown below
# sequence motif for the highly conserved G1 region of Ras GTPases with max. degeneracy set to 4 xxxxxxxx[GAE]xxxx[GAV][KT][TSGK]xxxxxxxxx